<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
  <title>The 12Dicts Word Lists, release 5</title>
  <meta content="Alan Beale" name="author">
</head>
<body style="color: rgb(0, 0, 0); background-color: rgb(236, 236, 193);" alink="#000088" link="#0000ff" vlink="#ff0000">
<h1>Release 5 of the 12dicts word lists</h1>
<p><big>This file describes release 5 of the 12dicts word
list package, released on June 3, 2007. &nbsp;Almost no changes
have been made to the files from previous editions, and so
you should refer to the <a href="readme.html">readme.html</a>
document from the previous&nbsp;release for information on them. The only changes
to existing files were to&nbsp;correct a small number of embarrassing errors.</big></p>
<p><big>It is probably valuable to present here the matrix of the
lists and their features updated to release 5.</big></p>
<p>
<table border="1">
  <tbody>
    <tr>
      <th><big></big></th>
      <td style="font-weight: bold;"><big>neol2007</big></td>
      <th><big>3esl</big></th>
      <th><big>6of12</big></th>
      <th><big>2of12</big></th>
      <th><big>2of4brif</big></th>
      <th><big>5desk</big></th>
      <td><big><span style="font-weight: bold;">2+2lemma<br>
2+2gfreq</span></big></td>
      <th><big>2of12inf</big></th>
    </tr>
    <tr>
      <td><big>Size</big></td>
      <td>373</td>
      <td><big>21877</big></td>
      <td><big>32153</big></td>
      <td><big>41236</big></td>
      <td><big>60387</big></td>
      <td><big>61406</big></td>
      <td>80431</td>
      <td><big>81536</big></td>
    </tr>
    <tr>
      <td><big>Abbreviations</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
    </tr>
    <tr>
      <td><big>Acronyms</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
    </tr>
    <tr>
      <td><big>American English</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
    </tr>
    <tr>
      <td><big>British English</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
    </tr>
    <tr>
      <td><big>Hyphenations</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
    </tr>
    <tr>
      <td><big>Inflections</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
    </tr>
    <tr>
      <td><big>Names</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
    </tr>
    <tr>
      <td><big>Phrases</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>Y</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
      <td><big>N</big></td>
    </tr>
  </tbody>
</table>
</p>
<p><big>The new lists, in brief, are as follows:</big></p>
<ol>
  <li>
    <p><big>The 2+2lemma list is composed of the words in the 2of12inf
and 2of4brif lists, lemmatized. &nbsp;The word "lemmatized" is a rare
word, which you will find in none of these lists, but what it means is
that this list is formatted as a collection of word sets, each set
composed of a headword and some number (possibly zero) of closely related
words. &nbsp;</big></p>
  </li>
  <li>
    <p><big>The 2+2gfreq list contains exactly the same words as
2+2lemma, but they have been arranged by frequency groups, using data
supplied by Google on the frequency of English words on the World Wide Web.</big></p>
  </li>
  <li>
    <p><big>The neol2007 list contains a number of new and/or trendy
words which you might choose to add as appropriate to the other lists,
if you are concerned about including the coolest (or is it hottest?)
buzzwords of the 21st century.</big></p>
  </li>
</ol>
<h2>The 2+2lemma list</h2>
<p><big>The list 2+2lemma.txt contains the words in the 2of12inf.txt
and 2of4brif.txt lists, plus a few additional words from&nbsp;3esl.txt.
&nbsp;Also, the new words from the neol2007.txt list (see&nbsp;<a href="#The_neol2007_list">below</a>)
have been added, marked with a + if they would not have otherwise been
included. (Marking the new words permits them to be removed if it is
preferred for these lists to be in synch with the older 12dicts lists.)
Finally, British forms of words in
the 2of12inf list not already in the 2of4brif list have been added.
&nbsp;Words
marked with a % in the 2of12inf list ("Scrabble inflections") have
however been omitted, with the result that, despite augmentation from other lists, this list in fact contains
fewer words than 2of12inf.txt.</big></p>
<p><big>The 2+2lemma list is not formatted as a simple list of words.
&nbsp;It is composed of entries of 1 or 2 lines each. &nbsp;The first
line contains a headword, and the second line, which is indented if
present, contains an alphabetized list of related words. &nbsp;A simple example:</big></p>
<p><big><span style="font-family: monospace;">funny</span><br style="font-family: monospace;">
<span style="font-family: monospace;">&nbsp;&nbsp;&nbsp; funnier, funnies, funniest, funnily, funniness</span></big></p>
<p><big>The list of related words contains three sorts of entries.</big></p>
<ol>
  <li>
    <p><big>Inflections.</big></p>
  </li>
  <li>
    <p><big>Variant spellings.</big></p>
  </li>
  <li>
    <p><big>Words formed with certain suffixes.</big></p>
  </li>
</ol>
<p><big>In addition to true variant spellings such as&nbsp;"grey" for
"gray" and "thru" for "through", item 2 also includes&nbsp;words
which, though pronounced differently, are clearly variants
of the headword. &nbsp;Thus, "hooray" is considered a variant of
"hurrah" (but mere synonyms like "furze" and "gorse" remain
independent).</big></p>
<p><big>Item 3 is based on a small list of suffixes, producing closely
and consistently related words. &nbsp;These suffixes are -ful, -ish,
-less, -like, -ly, -more and -ness. &nbsp;-ally is also allowed, if
there is no -al word to apply the -ly suffix to. &nbsp;(For instance,
"basically" is considered to be derived from "basic", because there is
no word "basical".) &nbsp;When one of these suffixes is used in an
unusual way, the resulting word is considered independent. &nbsp;For
instance, "likely" is not considered to be derived from "like", nor
"bashful" from "bash". &nbsp;There are some rather difficult questions
here, such as how closely "slavish" is related to "slave", or
"sluggish" to "slug". &nbsp;In general, I have chosen the course of
least surprise by treating&nbsp;such pairs as independent.</big></p>
<p><big>Here are some other notes on the determination of what words are related.</big></p>
<p><big>Certain uses of the suffixes -ed and -s are treated as
inflections, even though technically they are not. &nbsp;Thus, "talented"
is treated as derived from "talent", and "optics" from "optic".</big></p>
<p><big>Words ending with the suffix -ability/ibility are treated as relatives of the corresponding -able/ible word.</big></p>
<p><big>Sometimes, the choice of which variant to treat as the headword
is somewhat arbitrary. &nbsp;I have consistently chosen an American
spelling over a British spelling here. &nbsp;This has some effect on
the number of headwords. &nbsp;I treat "cheque" as a variant of
"check", whereas, to an observer with a British bias, they would no doubt be separate headwords.</big></p>
<p><big>No distinction is made of different meanings of the same word,
even when they are so different that&nbsp;dictionaries list them
separately. "wind" the noun and "wind" the verb are considered as a
single word, as are "second" the adjective, "second" the noun and
"second" the verb.</big></p>
<p><big>It may sometimes happen that two different words
have the same inflection ("putting" derives both from "putt" and "put";
"holier" relates to "holey" as well as "holy"), or that an inflection
is a headword in its own right (as with "wound", the past tense of
"wind", or "crooked", the past tense of "crook"). &nbsp;These
situations are noted in the 2+2lemma list as cross-references to the
alternate headword. There are two specific situations</big><big> which might not be obvious</big><big> where
inflections are treated as different words.
&nbsp;These occur when a present tense form or a -ness word has a
plural inflection, as with "meaning" and "kindness". &nbsp;Such words
are always made headwords, even when the relationship to the original
root is very close. &nbsp;Here is an example showing how
cross-references are indicated:</big></p>
<p><big style="font-family: monospace;">base<br>
&nbsp;&nbsp;&nbsp; based, baseless, basely, baseness, baser, bases -&gt; [basis], basest, basing</big></p>
<p><big>Almost always, a given word has only one cross-reference - the exception is the incredible tangle shown in the example below:</big></p>
<p><big style="font-family: monospace;">slue -&gt; [slough]<br>
&nbsp;&nbsp;&nbsp; slew -&gt; [slay, slew, slough], slewed, slewing,
slews -&gt; [slew, slough], slued, slues -&gt; [slough], sluing
</big></p>
<p><big>where 4 uncommon words mostly pronounced sloo have become thoroughly confused.</big></p>
<h2>The 2+2gfreq list</h2>
<p><big>The 2+2gfreq.txt file contains exactly the same words as the
2+2lemma list, but with the headwords arranged approximately by the
order of their frequency of use. &nbsp;The "g" in the name stands for
both "grouped" and "Google". &nbsp;Here's how it was put together.</big></p>
<p><big>In 2006, Google published a mammoth corpus of word and phrase
frequency data extracted from the English language Web. &nbsp;(See the <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">announcement</a>.) &nbsp;The
2+2gfreq list was made by accumulating the frequency counts for all of
the words associated with a single headword of 2+2lemma.txt, in all of
their spellings.
&nbsp;(There were in general multiple spellings for each word because
Google distinguished words on the basis of capitalization, so that
"price", "Price" and "PRICE" were counted separately.) &nbsp;The
resulting data was sorted by frequency, and then grouped into bands
based on powers of 2. &nbsp;That is, the band of least
frequency&nbsp;contained words which occurred&nbsp;200 to 400 times in
the Google data,
the next band contained words occurring 400 to 800 times, and so on.
&nbsp;After&nbsp;the words were grouped in this fashion, each band was
sorted
alphabetically by headword, and a separator line was inserted between
adjacent bands. &nbsp;There were 27 bands in all, plus a small number
of words which did not appear in Google's data at all.</big></p>
<p><big>One might reasonably inquire why the data was not simply
presented in frequency order. &nbsp;The reason is that I think this
would have implied more significance to the data than is actually
there. &nbsp;As I will explain in the following paragraphs, the Google
data is only loosely representative of true English word frequencies, and
further inaccuracies have been introduced by my own procedures. &nbsp;I
think that dividing the words into
frequency bands as I have done here is less prone to misinterpretation than some other
procedure purporting to greater accuracy.</big></p>
<p><big>Let me explore here some of the reasons for not taking the
Google frequency data, and my procedures for processing it, too
seriously.</big></p>
<ol>
  <li>
    <p><big>The Web is full of gibberish. &nbsp;The phrases
"NEVERENDING sweetie - animal thread" and&nbsp; "REDRUM REDRUM REDRUM
REDRUM REDRUM" each were found by Google slightly more often than the
phrase "over the past
10 years" (all about 160,000 times). &nbsp;I speculate that at least
some of this is explained by the technique of setting up many
replicated web pages linking to one another, with the hope of creating
the semblance of a frequently referenced site. &nbsp;At any rate, one
suspects
after seeing this example that the frequency of the word "sweetie" in
the Google data might be somewhat higher than the frequency in
literature and conversation. &nbsp;One quarter of the total use of that
word as counted by Google came from repetitions of the above&nbsp;not-exceptionally-lucid phrase!</big></p>
  </li>
  <li>
    <p><big>The Web is biased towards certain kinds of content, and the
vocabulary of that content is overrepresented. &nbsp;Three such biases
are towards advertising and marketing, computers and pornography. &nbsp;The
advertising bias is illustrated by the surprisingly high frequency of
words such as "credit", "sale", "brand" and "discount". &nbsp;The
computer bias is illustrated by words such as "click", "online", "icon"
and "network". &nbsp;And the pornographic bias is illustrated by the
high frequency of "anal", "nude", "teen" and other words less savory.
&nbsp;Perhaps my favorite example is that "nostdinc" (a compiler option
under the Linux operating system) occurs more frequently on the Web
than the common word "responsibility".</big></p>
  </li>
  <li>
    <p><big>Google's techniques for identification of English language
text are somewhat imperfect, and additionally, some pages contain text
in multiple languages. &nbsp;As a result, extremely common words in
other languages, such as "la", "el" and "en", show up in the Google
data with considerably higher frequency than credible for English.</big></p>
  </li>
  <li>
    <p><big>When I correlated Google's data with the 2+2lemma list, I
chose to ignore capitalization. &nbsp;This was necessary - as it would
appear that capitalization on the Web is random, or at least beyond
simple explanation. &nbsp;According to Google, the most common form of
the word "borscht" on the Web is "Borscht", and of "mesh" is "MeSH".
&nbsp;But there is a side-effect to this, which is that some words seem
to be unnaturally frequent because of having the same form
as&nbsp;commonly used names. &nbsp;Words in which this effect can be
observed include "john", "china", "bush",&nbsp;"yahoo" and "august".
&nbsp;And, perhaps surprisingly, the frequency of the headword "we" is
exaggerated, as the most frequent form of "us" is "US", and probably
most occurrences of "US" refer to the country rather than to the
pronoun.</big></p>
  </li>
  <li>
    <p><big>As noted above, the lemmatization of 2+2lemma introduces
certain ambiguities - should the word "putting" count for the "put" or
the "putt" headword? &nbsp;Since there is no way of knowing, when I
accumulated the frequencies I used the expedient technique of dividing
the count evenly between all the possible headwords. &nbsp;This assumes
that "put" and "putt" are equally probable interpretations, which is of
course wrong. &nbsp;Two excellent examples involving words of very high
frequency are two forms of the verb "to be": "are" and "art".
&nbsp;"are", as a noun, is&nbsp;an obscure unit of measurement,
but&nbsp;it is credited with half the total count for the word, and
thereby&nbsp;ends up in frequency band 5. &nbsp;(Not only are hardly
any uses of "are"&nbsp;noun uses, but, almost certainly, most
occurences of the plural "ares" are actually references to the Greek
god of war rather than to the unit.) &nbsp;"art" illustrates the other
side of the problem. &nbsp;It is an archaic form of the verb "to be",
and in this form is likely quite uncommon on the Web. &nbsp;The noun
"art" shows up in band 8, but if half its count had not been credited
to "be", it would be in band 7.</big></p>
  </li>
</ol>
<p><big>Now, after all that, you may be thinking that the Google
frequency data, and this use of it, is silly beyond measure, and I
don't want to leave you with that impression. &nbsp;With a few glaring
exceptions, like "are" and "john", I find the division of the 2+2lemma
data into frequency bands to be quite reasonable, and I'm making it
available for exactly that reason. &nbsp;I don't know whether there are
any practical uses for this data or not, but English word frequency
information has always been of interest to "word nerds" like myself,
and I offer this approximation on that basis.</big></p>
<h2><a name="The_neol2007_list"></a>The neol2007 list</h2>
<p><big>As I noted above, the existing 12dicts lists have not been
updated in the last 4 years. &nbsp;I have been working on other
projects, and I think it is unlikely that these lists will see any
further changes, except perhaps for minor error corrections. &nbsp;However,
language does not remain static, and the 2007 editions of the 12dicts
source dictionaries will contain some words which either did not exist,
or were not important enough to list, when their previous editions were
printed.</big></p>
<p><big>In lieu of trying to bring these lists up to date, I
am&nbsp;publishing the file neol2007.txt, which contains newly popular
words and phrases obtained from various sources which seem to me to be
possibly worth adding. &nbsp;(Some of these words are already in one or
more of the larger lists, but are now common enough to belong in the
smaller ones as well.) &nbsp;Many of these words relate to two of the
most important trends of recent years, the conflict between the
developed world and Islamic extremism (called by many "the war on
terror") and the growing importance of the Internet in our daily lives.
&nbsp;A few of the words, such as "break-dance" and "dotcom", actually
date back to the previous century - their omission from previous
releases of 12dicts reflects the fact that lexicographers&nbsp;never
manage to keep up. &nbsp;After all, it took them 20 years to recognize the word
"mosh".</big></p>
<p><big> neol2007.txt is divided into two parts, a section of
individual uncapitalized words and their inflections (as recorded in
2of12inf.txt and 2+2lemma.txt), and a section of additional hyphenated
words, phrases and acronyms. &nbsp;(Observe the use of the % suffix to
denote "Scrabble inflections".) &nbsp;Depending on your application for
the 12dicts lists, you can choose to ignore these words, or add them in
part or in whole to the other lists. &nbsp;(I note again that these
words have already been added to the 2+2lemma and 2+2gfreq lists,
marked with a plus sign to facilitate their removal.) &nbsp;I intend,
if there are further revisions to 12dicts, to provide an appropriate
neol20xx file each time.</big></p>
<h2><big><small>My other projects</small></big></h2>
<p><big>Since the previous release of 12dicts, I have been fooling
around with English spelling reform. &nbsp;One of the results of this
activity is the development of CAAPR and ABCD, both of which may be
downloaded from my&nbsp;website, <a href="http://www.wyrdplay.org">www.wyrdplay.org</a>.
&nbsp;CAAPR is the Combined Anglo-American Pronunciation Reference, a
fancy name for a bi-dialectal pronunciation dictionary whose word list
is derived primarily from the 12dicts 6of12 list. &nbsp;ABCD, Alan's
Basic Codes with Diacritics, is also a pronunciation dictionary, of a
somewhat different sort - the notation is designed to clarify when a
word&nbsp;is spelled in accordance with normal English spelling
patterns (as with "fault" or "tunnel"), and when it is not (as with
"fought" or "colonel"). &nbsp;Though these files were developed as a
result of my interest in spelling reform, they may be of interest to other
"word nerds" unconcerned with that particular quixotic pastime.</big></p>
<p><big>Click the following links to <a href="http://www.wyrdplay.org/AlanBeale/CAAPR-ref-12.html">CAAPR</a> and <a href="http://www.wyrdplay.org/AlanBeale/ABCD-def-12.html">ABCD</a> if interested.</big></p>
<h2><big><small>Conclusions</small></big></h2>
<p><big>In the previous editions of 12dicts, I suggested that you write
to me (biljir@pobox.com) and let me know what use you were making of
12dicts. &nbsp;I will repeat that request now. &nbsp;I have been
delighted to see the interest in these lists for projects ranging from
interactive games to literacy programs. &nbsp;And I have been
particularly pleased to occasionally hear of first-year Computer
Science assignments&nbsp; specifying a 12dicts list rather than
/usr/dicts/words for their input. &nbsp;Keep up the good work, and let
me know what you're doing.&nbsp;(Oh, and please put "12dicts" in the
subject line when you email me. &nbsp;This will allow me to easily
notice your mail even if it is misclassified by an overzealous filter as spam. &nbsp;Speaking of
spam, the publication of my email address in this package has led to a
marked increase in the amount of spam I receive and, ironically, much
of it contains subject lines which appear to have been
extracted at random from my own lists. This is a use of 12dicts of which I
do not approve!)</big></p>
<p><big>A note on "licensing": 2+2lemma.txt and 2+2gfreq.txt were
derived from 2of12inf.txt, which was itself derived in part from Kevin
Atkinson's AGID, described in the file agid.txt. &nbsp;I place no
additional restrictions on the use of these files beyond those imposed
by agid.txt. &nbsp;I release neol2007.txt into the public domain.</big></p>
<p><big>- Alan Beale -</big></p>
</body>
</html>
